Tokyo Institute of Technology
Abstract:We prove a regret lower bound for Gaussian-process bandits on a smooth compact Riemannian manifold $\M$ of dimension $d$ with intrinsic Matérn-$ν$ kernel ($ν>d/2$) that exposes how the geometry of the arm space enters the constant. For any algorithm and time horizon $T$ exceeding an explicit threshold, the worst-case expected regret over the RKHS-ball $\|f\|_{\Hil_{k_ν}}\!\le\!B$ satisfies \begin{multline*} \E[R_T(f)]\;\ge\;c_*(d,ν)\,B^{d/(2ν+d)}\,σ_n^{2ν/(2ν+d)} \\ \cdot\,\vol_g(\M)^{ν/(2ν+d)}\,T^{(ν+d)/(2ν+d)}(\log T)^{ν/(2ν+d)}. \end{multline*} The exponent matches the Vakili--Khezeli--Picheny upper bound \cite{vakili2021information}; the $\vol_g(\M)^{ν/(2ν+d)}$ factor is, to our knowledge, the first explicit volume-dependent geometric constant in a manifold GP-bandit lower bound. We extend the analysis in five directions: (i)~a companion Assouad-style proof gives a different lower bound with a strictly smaller $T$-exponent $(2ν+3d)/(4(ν+d))$ but with a polylog factor of the form $1/(\log\log T)^{(2ν+d)/(4(ν+d))}$, sharpening the $(\log T)^{ν/(2ν+d)}$ Fano polylog of Theorem~\ref{thm:main}; (ii)~we prove a $|G|^{1/2}$ upper bound on the regret of an extrinsic-kernel GP-UCB algorithm on a quotient space $\M=\Mt/G$, plus a bracketing theorem (Theorem~\ref{thm:gauge-bracket}); the precise constant is conjectured to take the modulated form $(1+(|G|-1)h(\rinj/κ))^{1/2}$ (Conjecture~\ref{conj:gauge-modulated}), validated numerically on $\SO(3)$; (iii)~we write the leading constant $c_*(d,ν)$ out fully; (iv)~we extract a curvature dependence $1+O(K\eps_T^2)$ via Bishop--Gromov; (v)~we transfer the bound to the Bayesian regret framework via the Yang--Barron / Castillo et al.\ Bayesian-Fano transfer.
Abstract:Beam alignment in mmWave phased arrays and RIS-assisted links is a stochastic bandit under both short TTI budgets and Doppler-induced non-stationarity. The arm space is a Riemannian manifold: $\sphere^2$ for steering, $\torus^n$ for phase combining, $\SO(3)$ for panel orientation, or the discrete torus $(\mathbb Z_B)^M$ with up to $K\!\sim\!10^{90}$ configurations for $B$-level RIS ($B\!=\!2^b$, $b$ bits/element); the intrinsic Matérn kernel of Borovitskiy et al.\ provides the base GP. We contribute two algorithmic pieces. \textbf{(C1)} A Kronecker-factorised intrinsic-product Matérn kernel on $(\mathbb Z_B)^M$ evaluating in $O(M)$ table lookups, making GP-UCB tractable at $K\sim 10^{90}$ where the extrinsic alternative is infeasible. \textbf{(C2)} AdaptiveGP-v2, an online sliding-window controller that selects $W$ by per-sample marginal likelihood, with predictive-variance and drift $z$-score reset triggers and a post-reset $β$-boost. On a four-speed ($v\!\in\!\{0.02,0.08,0.12,0.20\}$~km/h), $20$-seed paired campaign at $T\!=\!3000$, AdaptiveGP-v2 is statistically indistinguishable from the hand-tuned fixed-window oracle at every speed (Holm--Bonferroni-corrected paired differences cross zero); the operational benefit is the absence of a deployment-time per-speed calibration step, not a mean-regret improvement. On four static 3GPP-style mmWave benchmarks, intrinsic-kernel GP-UCB reduces cumulative regret by $25$--$45\%$ vs.\ codebook UCB1/Thompson and by $10$--$33\%$ vs.\ Euclidean-ambient GP-UCB on the toroidal arm spaces; a wideband OFDM ablation on a $100$~MHz channel confirms the advantage persists under frequency-selective fading ($\sim\!32$~Mbps/UE at initial access vs.\ UCB1). A third-party-simulator sanity check on Sionna CDL is reported in Section~V.
Abstract:Large Language Model (LLM)-driven Multi-Agent Systems (MAS) have demonstrated strong capability in complex reasoning and tool use, and heterogeneous agent pools further broaden the quality--cost trade-off space. Despite these advances, real-world deployment is often constrained by high inference cost, latency, and limited transparency, which hinders scalable and efficient routing. Existing routing strategies typically rely on expensive LLM-based selectors or static policies, and offer limited controllability for semantic-aware routing under dynamic loads and mixed intents, often resulting in unstable performance and inefficient resource utilization. To address these limitations, we propose AMRO-S, an efficient and interpretable routing framework for Multi-Agent Systems (MAS). AMRO-S models MAS routing as a semantic-conditioned path selection problem, enhancing routing performance through three key mechanisms: First, it leverages a supervised fine-tuned (SFT) small language model for intent inference, providing a low-overhead semantic interface for each query; second, it decomposes routing memory into task-specific pheromone specialists, reducing cross-task interference and optimizing path selection under mixed workloads; finally, it employs a quality-gated asynchronous update mechanism to decouple inference from learning, optimizing routing without increasing latency. Extensive experiments on five public benchmarks and high-concurrency stress tests demonstrate that AMRO-S consistently improves the quality--cost trade-off over strong routing baselines, while providing traceable routing evidence through structured pheromone patterns.
Abstract:Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details. Moreover, to realize this dual-perception paradigm, we construct GLoD, a large-scale text-rich TIMT dataset comprising 510K high-resolution global-local image-text pairs covering diverse real-world scenarios. Extensive experiments demonstrate that GLoTran substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.
Abstract:Leveraging the event-driven paradigm, Spiking Neural Networks (SNNs) offer a promising approach for constructing energy-efficient Transformer architectures. Compared to directly trained Spiking Transformers, ANN-to-SNN conversion methods bypass the high training costs. However, existing methods still suffer from notable limitations, failing to effectively handle nonlinear operations in Transformer architectures and requiring additional fine-tuning processes for pre-trained ANNs. To address these issues, we propose a high-performance and training-free ANN-to-SNN conversion framework tailored for Transformer architectures. Specifically, we introduce a Multi-basis Exponential Decay (MBE) neuron, which employs an exponential decay strategy and multi-basis encoding method to efficiently approximate various nonlinear operations. It removes the requirement for weight modifications in pre-trained ANNs. Extensive experiments across diverse tasks (CV, NLU, NLG) and mainstream Transformer architectures (ViT, RoBERTa, GPT-2) demonstrate that our method achieves near-lossless conversion accuracy with significantly lower latency. This provides a promising pathway for the efficient and scalable deployment of Spiking Transformers in real-world applications.
Abstract:This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation for Complex Layouts" competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.
Abstract:The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, while the evaluation of their ability to understand long texts with complex layout design is highly significant but largely overlooked. In this paper, we propose Menu OCR and Translation Benchmark (MOTBench), a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication. MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, providing a comprehensive assessment of their visual understanding and language processing capabilities. Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations. Experiments show that our automatic evaluation results are highly consistent with professional human evaluation. We evaluate a range of publicly available state-of-the-art LVLMs, and through analyzing their output to identify the strengths and weaknesses in their performance, offering valuable insights to guide future advancements in LVLM development. MOTBench is available at https://github.com/gitwzl/MOTBench.




Abstract:Handling lengthy context is crucial for enhancing the recognition and understanding capabilities of multimodal large language models (MLLMs) in applications such as processing high-resolution images or high frame rate videos. The rise in image resolution and frame rate substantially increases computational demands due to the increased number of input tokens. This challenge is further exacerbated by the quadratic complexity with respect to sequence length of the self-attention mechanism. Most prior works either pre-train models with long contexts, overlooking the efficiency problem, or attempt to reduce the context length via downsampling (e.g., identify the key image patches or frames) to decrease the context length, which may result in information loss. To circumvent this issue while keeping the remarkable effectiveness of MLLMs, we propose a novel approach using a hybrid transformer-MAMBA model to efficiently handle long contexts in multimodal applications. Our multimodal model can effectively process long context input exceeding 100k tokens, outperforming existing models across various benchmarks. Remarkably, our model enhances inference efficiency for high-resolution images and high-frame-rate videos by about 4 times compared to current models, with efficiency gains increasing as image resolution or video frames rise. Furthermore, our model is the first to be trained on low-resolution images or low-frame-rate videos while being capable of inference on high-resolution images and high-frame-rate videos, offering flexibility for inference in diverse scenarios.




Abstract:Large pre-trained models (LPMs) have demonstrated exceptional performance in diverse natural language processing and computer vision tasks. However, fully fine-tuning these models poses substantial memory challenges, particularly in resource-constrained environments. Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, mitigate this issue by adjusting only a small subset of parameters. Nevertheless, these methods typically employ random initialization for low-rank matrices, which can lead to inefficiencies in gradient descent and diminished generalizability due to suboptimal starting points. To address these limitations, we propose SVFit, a novel PEFT approach that leverages singular value decomposition (SVD) to initialize low-rank matrices using critical singular values as trainable parameters. Specifically, SVFit performs SVD on the pre-trained weight matrix to obtain the best rank-r approximation matrix, emphasizing the most critical singular values that capture over 99% of the matrix's information. These top-r singular values are then used as trainable parameters to scale the fundamental subspaces of the matrix, facilitating rapid domain adaptation. Extensive experiments across various pre-trained models in natural language understanding, text-to-image generation, and image classification tasks reveal that SVFit outperforms LoRA while requiring 16 times fewer trainable parameters.




Abstract:A timeline provides a total ordering of events and times, and is useful for a number of natural language understanding tasks. However, qualitative temporal graphs that can be derived directly from text -- such as TimeML annotations -- usually explicitly reveal only partial orderings of events and times. In this work, we apply prior work on solving point algebra problems to the task of extracting timelines from TimeML annotated texts, and develop an exact, end-to-end solution which we call TLEX (TimeLine EXtraction). TLEX transforms TimeML annotations into a collection of timelines arranged in a trunk-and-branch structure. Like what has been done in prior work, TLEX checks the consistency of the temporal graph and solves it; however, it adds two novel functionalities. First, it identifies specific relations involved in an inconsistency (which could then be manually corrected) and, second, TLEX performs a novel identification of sections of the timelines that have indeterminate order, information critical for downstream tasks such as aligning events from different timelines. We provide detailed descriptions and analysis of the algorithmic components in TLEX, and conduct experimental evaluations by applying TLEX to 385 TimeML annotated texts from four corpora. We show that 123 of the texts are inconsistent, 181 of them have more than one ``real world'' or main timeline, and there are 2,541 indeterminate sections across all four corpora. A sampling evaluation showed that TLEX is 98--100% accurate with 95% confidence along five dimensions: the ordering of time-points, the number of main timelines, the placement of time-points on main versus subordinate timelines, the connecting point of branch timelines, and the location of the indeterminate sections. We provide a reference implementation of TLEX, the extracted timelines for all texts, and the manual corrections of the inconsistent texts.